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Abstract 

Multidimensional recurrent neural networks (MDRNNs) have shown a remark¬ 
able performance in the area of speech and handwriting recognition. The perfor¬ 
mance of an MDRNN is improved by further increasing its depth, and the dif¬ 
ficulty of learning the deeper network is overcome by using Hessian-free (HF) 
optimization. Given that connectionist temporal classification (CTC) is utilized as 
an objective of learning an MDRNN for sequence labeling, the non-convexity of 
CTC poses a problem when applying HF to the network. As a solution, a convex 
approximation of CTC is formulated and its relationship with the EM algorithm 
and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 
layers is successfully trained using HF, resulting in an improved performance for 
sequence labeling. 


1 Introduction 

Multidimensional recurrent neural networks (MDRNNs) constitute an efficient architecture for 
building a multidimensional context into recurrent neural networks End-to-end training of 
MDRNNs in conjunction with connectionist temporal classification (CTC) has been shown to 
achieve a state-of-the-art performance in on/off-line handwriting and speech recognition |l2l,[3l01. 

In previous approaches, the performance of MDRNNs having a depth of up to five layers, which 
is limited as compared to the recent progress in feedforward networks 11, was demonstrated. The 
effectiveness of MDRNNs deeper than five layers has thus far been unknown. 

Training a deep architecture has always been a challenging topic in machine learning. A notable 
breakthrough was achieved when deep feedforward neural networks were initialized using layer- 
wise pre-training H. Recently, approaches have been proposed in which supervision is added to 
intermediate layers to train deep networks SH . To the best of our knowledge, no such pre-training 
or bootstrapping method has been developed for MDRNNs. 

Alternatively, Hesssian-free (HE) optimization is an appealing approach to training deep neural 
networks because of its ability to overcome pathological curvature of the objective function lUll . 
Eurthermore, it can be applied to any connectionist model provided that its objective function is 
differentiable. The recent success of HE for deep feedforward and recurrent neural networks lH 
supports its application to MDRNNs. 

In this paper, we claim that an MDRNN can benefit from a deeper architecture, and the application of 
second order optimization such as HE allows its successful learning. First, we offer details of the de¬ 
velopment of HF optimization for MDRNNs. Then, to apply HF optimization for sequence labeling 
tasks, we address the problem of the non-convexity of CTC, and formulate a convex approximation. 
In addition, its relationship with the EM algorithm and the Eisher information matrix is discussed. 
Experimental results for offline handwriting and phoneme recognition show that an MDRNN with 
HE optimization performs better as the depth of the network increases up to 15 layers. 
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2 Multidimensional recurrent neural networks 


MDRNNs constitute a generalization of RNNs to process multidimensional data ^ replacing the 
single recurrent connection with as many connections as the dimensions of the data (ll] . The network 
can access the contextual information from 2^ directions, allowing a collective decision to be made 
based on rich context information. To enhance its ability to exploit context information, long short¬ 
term memory (LSTM) fl^ cells are usually utilized as hidden units. In addition, stacking MDRNNs 
to construct deeper networks further improves the performance as the depth increases, achieving the 
state-of-the-art performance in phoneme recognition 0. For sequence labeling, CTC is applied as 
a loss function of the MDRNN. The important advantage of using CTC is that no pre-segmented 
sequences are required, and the entire transcription of the input sample is sufficient. 

2.1 Learning MDRNNs 

A d-dimensional MDRNN with M inputs and K outputs is regarded as a mapping from an input 
sequence x G x Ti x ■ ■ • x output sequence a G ^ of length T, where the input data for 

M input neurons are given by the vectorization of d-dimensional data, and Ti ,..., is the length 
of the sequence in each dimension. All learnable weights and biases are concatenated to obtain a 
parameter vector 6 G . In the learning phase with hxed training data, the MDRNN is formalized 
as a mapping Af : —>■ from the parameters 9 to the output sequence a, i.e., a = Af{6). 

The scalar loss function is defined over the output sequence as C : ^ R. Learning an 

MDRNN is viewed as an optimization of the objective C{J\f{6)) = Co Af{6) with respect to 9. 

The Jacobian Jjr of a function F : R™ —R" is the n x m matrix where each element is a partial 
derivative of an element of output with respect to an element of input. The Hessian Hjr of a scalar 
function T : R™ R is the m x m matrix of second-order partial derivatives of the output with 
respect to its inputs. Throughout this paper, a vector sequence is denoted by boldface a, a vector at 
time f in a is denoted by a‘, and the fc-th element of a* is denoted by a^. 

3 Hessian-free optimization for MDRNNs 

The application of HF optimization to an MDRNN is straightforward if the matching loss func¬ 
tion iflill for its output layer is adopted. However, this is not the case for CTC, which is necessarily 
adopted for sequence labeling. Before developing an appropriate approximation to CTC that is com¬ 
patible with HF optimization, we discuss two considerations related to the approximation. The hrst 
is obtaining a quadratic approximation of the loss function, and the second is the efficient calculation 
of the matrix-vector product used at each iteration of the conjugate gradient (CG) method. 

HF optimization minimizes an objective by constructing a local quadratic approximation for the 
objective function and minimizing the approximate function instead of the original one. The loss 
function C{9) needs to be approximated at each point of the n-th iteration: 

Qn(9) = C(9n) + VeC\J^Sn + —Sj GSn, ( 1 ) 

where Sn = 9 — 9n is the search direction, i.e., the parameters of the optimization, and G is a 
local approximation to the curvature of C{9) at which is typically obtained by the generalized 
Gauss-Newton (GGN) matrix as an approximation of the Hessian. 

HF optimization uses the CG method in a subroutine to minimize the quadratic objective above for 
utilizing the complete curvature information and achieving computational efficiency. CG requires 
the computation of Gv for an arbitrary vector v, but not the explicit evaluation of G. For neural 
networks, an efficient way to compute Gv was proposed in flln . extending the study in lIT^ . In 
sectionwe provide the details of the efficient computation of Gv for MDRNNs. 

3.1 Quadratic approximation of loss function 

The Hessian matrix, HcoAf, of the objective C {Af (9)) is written as 

KT 

HcoM = jJrHcJN + ( 2 ) 

i=l 
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where Jj^/ G He G and [g]j denotes the i-th component of the vector q. 

An indefinite Hessian matrix is problematic for second-order optimization, because it defines an 
unbounded local quadratic approximation IT^ . For nonlinear systems, the Hessian is not necessarily 
positive semidefinite, and thus, the GGN matrix is used as an approximation of the Hessian ifTll lsl]. 
The GGN matrix is obtained by ignoring the second term in Eq. (|2|l, as given by 

GcM = JjfHcJN- ( 3 ) 

The sufficient condition for the GGN approximation to be exact is that the network makes a perfect 
prediction for every given sample, that is, Je = 0, or [A/]^ stays in the linear region for all i, that is, 
= 0 . 

GcoM has less rank than KT and is positive semidefinite provided that He is. Thus, C is chosen to 
be a convex function so that He is positive semidefinite. In principle, it is best to define C and M 
such that C performs as much of the computation as possible, with the positive semidefiniteness of 
He as a minimum requirement lIT^ . In practice, a nonlinear output layer together with its matching 
loss function fill] , such as the softmax function with cross-entropy loss, is widely used. 

3.2 Computation of matrix-vector product for MDRNN 

The product of an arbitrary vector v by the GGN matrix, Gv = Jj/HeJ^v, amounts to the se¬ 
quential multiplication of v by three matrices. First, the product Jj^v is a Jacobian times vector 
and is therefore equal to the directional derivative of N{9) a long the direction of v. Thus, Jjq-v can 
be written using a differential operator Jj\fv = 'R-.v{M{9)) ill2[] and the properties of the operator 
can be utilized for efficient computation. Because an MDRNN is a composition of differentiable 
components, the computation of TZv{Af{9)) throughout the whole network can be accomplished by 
repeatedly applying the sum, product, and chain rules starting from the input layer. The detailed 
derivation of the TZ operator to LSTM, normally used as a hidden unit in MDRNNs, is provided in 
appendix A. 

Next, the multiplication of Jj^v by He can be performed by direct computation. The dimension 
of He could at first appear problematic, since the dimension of the output vector used by the loss 
function C can be as high as KT, in particular, if CTC is adopted as an objective for the MDRNN. 
If the loss function can be expressed as the sum of individual loss functions with a domain restricted 
in time, the computation can be reduced significantly. For example, with the commonly used cross¬ 
entropy loss function, the KT x KT matrix He can be transformed into a block diagonal matrix 
with T blocks of a AT x AT Hessian matrix. Let He,t be the f-th block in He- Then, the GGN matrix 
can be written as 

GeoM = E jJ/He,tJK,, ( 4 ) 

t 

where Jjv; is the Jacobian of the network at time t. 

Finally, the multiplication of a vector u = HeJj\fv by the matrix Jjj- calculated using the back- 
propagation through time algorithm by propagating u instead of the error at the output layer. 


4 Convex approximation of CTC for application to HF optimization 

Connectioninst temporal classification (CTC) H provides an objective function of learning an 
MDRNN for sequence labeling. In this section, we derive a convex approximation of CTC inspired 
by the GGN approximation according to the following steps. First, the non-convex part of the 
original objective is separated out by reformulating the softmax part. Next, the remaining convex 
part is approximated without altering its Hessian, making it well matched to the non-convex part. 
Finally, the convex approximation is obtained by reuniting the convex and non-convex parts. 


4.1 Connectionist temporal classification 


CTC is formulated as the mapping from an output sequence of the recurrent network, a G 
to a scalar loss. The output activations at time t are normalized using the softmax function 


exp(ofc) 

Efc' exp(4,)’ 


( 5 ) 


3 



where yl, is the probability of label k given a at time t. 

The conditional probability of the path tt is calculated by the multiplication of the label probabilities 
at each timestep, as given by 

T 

p(7r|a) = (6) 

t=i 

where tt* is the label observed at time t along the path tt. The path tt of length T is mapped to a 
label sequence of length M < T by an operator B, which removes the repeated labels and then 
the blanks. Several mutually exclusive paths can map to the same label sequence. Let S' be a set 
containing every possible sequence mapped by B, that is, S = {s|s G B{'k) for some tt} is the 
image of B, and let |S| denote the cardinality of the set. 

The conditional probability of a label sequence 1 is given by 

P(l|a) = X! 

which is the sum of probabilities of all the paths mapped to a label sequence 1 by B. 

The cross-entropy loss assigns a negative log probability to the correct answer. Given a target 
sequence z, the loss function of CTC for the sample is written as 

£(a) =-logp(z|a). (8) 

From the description above, CTC is composed of the sum of the product of softmax components. 
The function — log(y^), corresponding to the softmax with cross-entropy loss, is convex lITll] . 
Therefore, is log-concave. Whereas log-concavity is closed under multiplication, the sum of 
log-concave functions is not log-concave in general ifisll . As a result, the CTC objective is not 
convex in general because it contains the sum of softmax components in Eq. O- 


4.2 Reformulation of CTC objective function 


We reformulate the CTC objective Eq. (l8]l to separate out the terms that are responsible for the non¬ 
convexity of the function. By reformulation, the softmax function is defined over the categorical 
label sequences. 

By substituting Eq. (|5]l into Eq. (|6l), it follows that 


where 

as 


p(7r|a) 


exp(6^) 

E^'eaiiexp(V)’ 


aJrj. By substituting Eq. (|9l) into Eq. (|7|i and setting 1 


( 9 ) 


z, p(z|a) can be re-written 


p(z|a) 


E^gg-i(z)exp(6^) 

E^reaii exp(&^) 


exp(/z) 

Ez'esexp(/z')’ 


( 10 ) 


where S is the set of every possible label sequence and /z = log 

sum-exp functiorQ, which is proportional to the probability of observing the label sequence z among 
all the other label sequences. 


With the reformulation above, the CTC objective can be regarded as the cross-entropy loss with the 
softmax output, which is defined over all the possible label sequences. Because the cross-entropy 
loss function matches the softmax output layer iflTI] . the CTC objective is convex, except the part 
that computes /z for each of the label sequences. At this point, an obvious candidate for the convex 
approximation of CTC is the GGN matrix separating the convex and non-convex parts. 

Let the non-convex part be Nc and the convex part be Cc- The mapping Me '■ —>■ is 

defined by 

Ar,(a) = F^= [/z,,...,/z,„]^, (11) 

..., Xn) ~ logje^'i -f • • • -f e'^'*) is the log-sum-exp function defined on R** 


4 






where is given above, and [S'! is the number of all the possible label sequences. For given F as 
above, the mapping Cc '■ R is defined by 


j0-c{F) = - log 


exp(/z) 

Ez'GSexp(/z') 


-fz + log 



( 12 ) 


where z is the label sequence corresponding to a. The final reformulation for the loss function of 
CTC is given by 

£(a) = £eoA4(a). (13) 


4.3 Convex approximation of CTC loss function 

The GGN approximation of Eq. (fOl l immediately gives a convex approximation of the Hessian for 
CTC as GccoAfc — Although Hc^ has the form of a diagonal matrix plus a rank-1 

matrix, i.e., diag(y) — YY^, the dimension of Hc^ is [S'! x [S'!, where jS”! becomes exponentially 
large as the length of the sequence increases. This makes the practical calculation of difficult. 

On the other hand, removing the linear team —/z from Cc{F) in Eq. (fT^ does not alter its Hessian. 
The resulting formula is Cp{F) = log (X^z'gS GGN matrices of C = Cc o J\fc 

and M. = Cp o Me are the same, i.e., Gc^oMa = Gcj,oj\fc- Therefore, their Hessian matrices are 
approximations of each other. The condition that the two Hessian matrices. He and Hm^ converges 
to the same matrix is discussed below. 

Interestingly, A4 is given as a compact formula AT (a) = Cp o A/'c(a) = log exp(a^), where 
is the output unit k at time t. Its Hessian Hj^ can be directly computed, resulting in a block 
diagonal matrix. Each block is restricted in time, and the f-th block is given by 

HM,t = diag(y*) - (i4) 

where Y^ = [y\,..., and yj. is given in Eq. (|5]|. Because the Hessian of each block is positive 
semidefinite, is positive semidefinite. A convex approximation of the Hessian of an MDRNN 
using the CTC objective can be obtained by substituting Hm for He in Eq. Q. Note that the 
resulting matrix is block diagonal and Eq. (|4l) can be utilized for efficient computation. 

Our derivation can be summarized as follows: 

1. He = He^oMc 1^ not positive semidefinite. 

2. GecoMc = Ge^oMc 1^ positive semidefinite, but not computationally tractable. 

3. He^oMc 1® positive semidefinite and computationally tractable. 

4.4 Sufficient condition for the proposed approximation to be exact 

From Eq. (|2|i, the condition He^oMc = He^oAfe holds if and only if = 

Since Je^ ^ Jep in general, we consider only the case of H[j\f^]. = 0 for 
all i, which corresponds to the case where Me is a linear mapping. 

[Mc]i contains a log-sum-exp function mapping from paths to a label sequence. Let 1 be the label 
sequence corresponding to \Mc\i', then, [M(^i = /i(..., 6,r, ■ ■ ■) for tt G i3~^(l). If the probability 
of one path tt' is sufficiently large to ignore all the other paths, that is, exp(6^/) ^ exp(6.n.) for 
TT G {S“^(l)\7r'}, it follows that /i(..., 6^/,...) = bT^i. This is a linear mapping, which results in 
Hia/.u = 0 . 

In conclusion, the condition He^oAfe = HepoAfe holds if one dominant path tt G exists such 

that f\{... jbp-,...) = b^r for each label sequence 1. 


4.5 Derivation of the proposed approximation from the Fisher information matrix 

The identity of the GGN and the Eisher information matrix Jl^ has been shown for the network 
using the softmax with cross-entropy loss iGtIS. Thus, it follows that the GGN matrix of Eq. (fT3l) 
is identical to the Eisher information matrix. Now, we show that the proposed matrix in Eq. (fTTt 
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is derived from the Fisher information matrix under the condition given in section The Fisher 
information matrix of an MDRNN using CTC is written as 


F = E, 




/ aiogp(i|a )y f aiogp(i|a] 

V da 


\ da 


Jm 


(15) 


where a = a(x, 9) is the iTT-dimensional output of the network M. CTC assumes output proba¬ 
bilities at each timestep to be independent of those at other timesteps lH, and therefore, its Fisher 
information matrix is given as the sum of every timestep. It follows that 


F = E, 




'l~p(l|a) 


/ c)logp(l|a)\ f c)logp(l|a) 


V da* 


\ da* 


J.N't 


Under the condition in section B^ the Fisher information matrix is given by 


F = Ev 


^j^^(diag(y)-y‘y‘')j^, 


(16) 


(17) 


which is the same form as Eqs. (|4]i and (fl4l l combined. See appendix B for the detailed derivation. 


4.6 EM interpretation of the proposed approximation 


The goal of the Expectation-Maximization (EM) algorithm is to find the maximum likelihood so¬ 
lution for models having latent variables ifl^ . Given an input sequence x, and its corresponding 
target label sequence z, the log likelihood of z is given by logp(z|x, 9) = logX^TreB^H*) p(7r|x,6»), 
where 9 represents the model parameters. Eor each observation x, we have a corresponding latent 
variable q which is a 1-of-fc binary vector where k is the number of all the paths mapped to z. The 
log likelihood can be written in terms of q as logp(z, glx, 9) — logF('^|X: ^)- The 

EM algorithm starts with an initial parameter 9, and repeats the following process until convergence. 


Expectation step calculates: 7 ^|x ,2 = 


p(^|x,e) 


Maximization step updates: 9 — argmax^Q(6*), where Q{9) = T-n-lx,* logp(7r|x, 9). 

In the context of CTC and RNN, p(7r|x, 9) is given as p(7r|a(x, 9)) as in Eq. (|6l), where a(x, 9) is 
the iTT-dimensional output of the neural network. Taking the second-order derivative of logp(7r|a) 
with respect to a* gives dmg{Y*) —Y*Y*^ , with Y* as in Eq. (fT4l) . Because this term is independent 
of TT and Y^TrGB~^(z) 77 r|x,z = the Hessian of Q with respect to a* is given by 

= diag(y‘) - Y*Y*^, (18) 

which is the same as the convex approximation in Eq. (fT4l i. 


5 Experiments 

In this section, we present the experimental results for two different sequence labeling tasks, offline 
handwriting recognition and phoneme recognition. The performance of Hessian-free optimization 
for MDRNNs with the proposed matrix is compared with that of stochastic gradient descent (SGD) 
optimization on the same settings. 

5.1 Database and preprocessing 

The lEN/ENIT Database ||20|] is a database of handwritten Arabic words, which consists of 32,492 
images. The entire dataset has five subsets (a, b, c, d, e). The 25,955 images corresponding to the 
subsets {b — e) were used for training. The validation set consisted of 3,269 images corresponding 
to the first half of the sorted list in alphabetical order (ae07_001.tif — ai54_028.tif) in set a. The 
remaining images in set a, amounting to 3,268, were used for the test. The intensity of pixels was 
centered and scaled using the mean and standard deviation calculated from the training set. 
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The TIMIT corpus is a benchmark database for evaluating speech recognition performance. 
The standard training, validation, and core datasets were used. Each set contains 3,696 sentences, 
400 sentences, and 192 sentences, respectively. A mel spectrum with 26 coefficients was used as 
a feature vector with a pre-emphasis filter, 25 ms window size, and 10 ms shift size. Each input 
feature was centered and scaled using the mean and standard deviation of the training set. 

5.2 Experimental setup 

Eor handwriting recognition, the basic architecture was adopted from that proposed in Jst]. Deeper 
networks were constructed by replacing the top layer with more layers. The number of LSTM cells 
in the augmented layer was chosen such that the total number of weights between the different 
networks was similar. The detailed architectures are described in Table[T] together with the results. 

Eor phoneme recognition, the deep bidirectional LSTM and CTC in |3l was adopted as the basic 
architecture. In addition, the memory cell block lITol] . in which the cells share the gates, was applied 
for efficient information sharing. Each LSTM block was constrained to have 10 memory cells. 

According to the results, using a large value of bias for input/output gates is beneficial for training 
deep MDRNNs. A possible explanation is that the activation of neurons is exponentially decayed 
by input/output gates during the propagation. Thus, setting large bias values for these gates may 
facilitate the transmission of information through many layers at the beginning of the learning. Eor 
this reason, the biases of the input and output gates were initialized to 2, whereas those of the forget 
gates and memory cells were initialized to 0. All the other weight parameters of the MDRNN were 
initialized randomly from a uniform distribution in the range [—0.1, 0.1]. 

The label error rate was used as the metric for performance evaluation, together with the average 
loss of CTC in Eq. ® . It is defined by the edit distance, which sums the total number of insertions, 
deletions, and substitutions required to match two given sequences. The final performance, shown 
in Tables [T] and |2] was evaluated using the weight parameters that gave the best label error rate on 
the validation set. To map output probabilities to a label sequence, best path decoding il was used 
for handwriting recognition and beam search decoding with a beam width of 100 was used 

for phoneme recognition. Eor phoneme recognition, 61 phoneme labels were used during training 
and decoding, and then, mapped to 39 classes for calculating the phoneme error rate (PER) flUHj. 

Eor phoneme recognition, the regularization method suggested in ll2^ was used. We applied Gaus¬ 
sian weight noise of standard deviation a = {0.03, 0.04, 0.05} together with L2 regularization of 
strength 0.001. The network was first trained without noise, and then, it was initialized to the weights 
that gave the lowest CTC loss on the validation set. Then, the network was retrained with Gaussian 
weight noise |0]. Table |2]presents the best result for different values of a. 

5.2.1 Parameters 

Eor HE optimization, we followed the basic setup described in |@], but different parameters were 
utilized. Tikhonov damping was used together with Levenberg-Marquardt heuristics. The value of 
the damping parameter A was initialized to 0.1, and adjusted according to the reduction ratio p (mul¬ 
tiplied by 0.9 if p > 0.75, divided by 0.9 if p < 0.25, and unchanged otherwise). The initial search 
direction for each run of CG was set to the CG direction found by the previous HE optimization 
iteration decayed by 0.7. To ensure that CG followed the descent direction, we continued to perform 
a minimum 5 and maximum 30 of additional CG iterations after it found the first descent direction. 
We terminated CG at iteration i before reaching the maximum iteration if the following condition 
was satisfied: {(j>{xi) — (j>{xi- 5 ))/(j>{xi) < 0.005 , where (f) is the quadratic objective of CG with¬ 
out offset. The training data were divided into 100 and 50 mini-batches for the handwriting and 
phoneme recognition experiments, respectively, and used for both the gradient and matrix-vector 
product calculation. The learning was stopped if any of two criteria did not improve for 20 epochs 
and 10 epochs in handwriting and phoneme recognition, respectively. 

Eor SGD optimization, the learning rate e was chosen from {10“"^, 10“®, 10“®}, and the momentum 
p from {0.9, 0.95, 0.99}. Eor handwriting recognition, the best performance obtained using all the 
possible combinations of parameters is presented in Table [T] Eor phoneme recognition, the best 
parameters out of nine candidates for each network were selected after training without weight noise 
based on the CTC loss. Additionally, the backpropagated error in LSTM layer was clipped to remain 
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in the range [—1,1] for stable learning ll^ . The learning was stopped after 1000 epochs had been 
processed, and the final performance was evaluated using the weight parameters that showed the best 
label error rate on the validation set. It should be noted that in order to guarantee the convergence, 
we selected a conservative criterion as compared to the study where the network converged after 85 
epochs in handwriting recognition iH] and after 55-150 epochs in phoneme recognition 0). 

5.3 Results 

Table [T] presents the label error rate on the test set for handwriting recognition. In all cases, the 
networks trained using HF optimization outperformed those using SGD. The advantage of using HF 
is more pronounced as the depth increases. The improvements resulting from the deeper architecture 
can be seen with the error rate dropping from 6.1% to 4.5% as the depth increases from 3 to 13. 

Table|2]shows the phoneme error rate (PER) on the core set for phoneme recognition. The improved 
performance according to the depth can be observed for both optimization methods. The best PER 
for HE optimization is 18.54% at 15 layers and that for SGD is 18.46% at 10 layers, which are 
comparable to that reported in 0, where the reported results are a PER of 18.6% from a network 
with 3 layers having 3.8 million weights and a PER of 18.4% from a network with 5 layers having 
6.8 million weights. The benefit of a deeper network is obvious in terms of the number of weight 
parameters, although this is not intended to be a definitive performance comparison because of 
the different preprocessing. The advantage of HE optimization is not prominent in the result of 
the experiments using the TIMIT database. One explanation is that the networks tend to overfit 
to a relatively small number of the training data samples, which removes the advantage of using 
advanced optimization techniques. 

Table 1: Experimental results for Arabic offline handwriting recognition. The label error rate is 
presented with the different network depths. denotes a stack of B layers having A hidden 
LSTM cells in each layer. “Epochs” is the number of epochs required by the network using HE 
optimization so that the stopping criteria are fulfilled, e is the learning rate and p is the momentum. 


NETWORKS 

DEPTH 

WEIGHTS 

HF (%) 

EPOCHS 

SGD(%) 


2-10-50 

3 

159,369 

6.10 

77 

9.57 

{10"'‘,0.9} 

2-10-21® 

5 

157,681 

5.85 

90 

9.19 

{10"®,0.99} 

2-10-14® 

8 

154,209 

4.98 

140 

9.67 

{10"®,0.95} 

2-10-12® 

10 

154,153 

4.95 

109 

9.25 

{10"®,0.95} 

2-10-10“ 

13 

150,169 

4.50 

84 

10.63 

{10"®,0.9} 

2-10-9®® 

15 

145,417 

5.69 

84 

12.29 

{10"®,0.99} 


Table 2: Experimental results for phoneme recognition using the TIMIT corpus. PER is presented 
with the different MDRNN architectures (depth x block x cell/block). a is the standard deviation 
of Gaussian weight noise. The remaining parameters are the same as in Table[T] 


NETWORKS 

WEIGHTS 

HF (%) 

EPOCHS 


SGD(%) 

{e, P,ct} 

3 X 20 X 10 

771,542 

20.14 

22 

{0.03} 

20.96 

{10"®, 0.99, 0.05 } 

5 X 15 X 10 

795,752 

19.18 

30 

{0.05} 

20.82 

{10"®, 0.9, 0.04} 

8 X 11 X 10 

720,826 

19.09 

29 

{0.05} 

19.68 

{10-®, 0.9, 0.04} 

10 X 10 X 10 

755,822 

18.79 

60 

{0.04} 

18.46 

{10"®, 0.95, 0.04} 

13 X 9 X 10 

806,588 

18.59 

93 

{0.05} 

18.49 

{10"®, 0.95, 0.04} 

15 X 8 X 10 

741,230 

18.54 

50 

{0.04} 

19.09 

{10"®, 0.95, 0.03 } 

3 X 250 X V 

3.8M 




18.6 

{lO”®, 0.9, 0.075 } 

5 X 250 X A 

6.8M 




18.4 

{10-®, 0.9, 0.075 } 


I The results were reported by Graves in 2013 0 . 

6 Conclusion 


Hessian-free optimization as an approach for successful learning of deep MDRNNs, in conjunction 
with CTC, was presented. To apply HE optimization to CTC, a convex approximation of its objective 
function was explored. In experiments, improvements in performance were seen as the depth of the 
network increased for both HE and SGD. HE optimization showed a significantly better performance 
for handwriting recognition than did SGD, and a comparable performance for speech recognition. 
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A Derivation of the TZ operator to LSTM 

We follow the version of LSTM in |01- The forward pass of LSTM is given by 

-f — \ -\- TLczCt— 1 “t“ 

ft = a{WxfXt + Whfht-i + WcfCt-i + bf), 
ct = ft- ct-i + it ■ tanh{WxcXt + Whch-i + be), 

Ot — ^(f^xoXt “t“ ^Vtioht—l^VcoCt bo), 
ht = Of tanh(ct), 

where • denotes the element-wise vector product, a is the logistic sigmoid function, x, h, and c are 
the input, hidden, and cell activation vector, respectively, and i, o, and / are the input, output, and 
forget gates, respectively. All the gates and cells are the same size as the hidden vector h. 

Applying the TZ operator to the above equations gives 

TZviit) = <x'{WxiXt + Whiht-i + WciCt-i + bi) 

• [VxiXt + Vfiiht-i + VciCt-i -\-Vi + WhiR-x{ht-i) + WciR-x{ct-i)), 

T^vift) = (x'fWxfXt + Whfht-i + WcfCt-i + bf) 

■ {VxfXt + Vhfht-i + VcfCt-i +Vf + Whf'R.v{ht-i) + WcfTZv{ct-i)), 

TZy{ct) = Uxift) ■ ct-i + ft ■ Tlvict-i) + TZvik) ■ tavLh{WxcXt + Whcht-i + be) 

+ it ■ tanh' (WxeXt + Wheht-l + be) ■ {VxeXt + Vfieht-l + Ve + WheT^v{ht-l)), 
T^v{ot) = (j' {WxoXt + Whoht—l + WeoCt + bo), 

• {VxoXt + Vhoht-1 + VeoCt + Vo + WhoT^v(ht-l) + WeoTZv(ct)), 

TZv{ht) = 7Zy{ot) ■ tanh(ct) + o* • tanh'(ct) • TZy{ct), 

where Vtj and Vi are taken from v at the same point of Wij and bi in 6, respectively. Note that 0 and 
V have the same dimension. 


B Detailed derivation of the proposed approximation from the Fisher 
information matrix 


The derivative of the negative log probability of Eq. O is given by 


giogp(l|a) 

dal 


= yk 


1 

p(l|a) 


at(s)/3i(s)- 

s^lab{\.,k) 


(19) 


where at (s) and fit (s) denote forward and backward variables, respectively, and labi}, k) = {«!!„ = 
k} is the set of positions, where label k occurs in 1 Qa. For compact notation, let F* denote a 
column matrix containing as its A:-th element, and let V* denote a column matrix containing 
^ J2seiab(i,k) at(s)/3t(s) as its fc-th element. 

The Fisher information matrix 115] is defined by 


F=E, 


E, 


t~p(l|x) 


/ 91ogp(l|x, 61) / 91ogp(l|x, 0) 

I 80 J [ 80 


The Fisher information matrix of an MDRNN using CTC is written as 




= Ev 


Ei.v.p(i| 


/ aiogp(l|a) 


V 9a 


JjKf 


/ 91ogp(l|a) 
I 9a 


Ja/- 


/91ogp(l|a)\ /91ogp(l|a) 


V 9a 


\ 9a 


Ja/- 


( 20 ) 


( 21 ) 

( 22 ) 


where a = a(x, 0) is the ATT-dimensional output of the network AA. The final step follows from that 
J^f is independent of 1 . 
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CTC assumes the output probabilities at each timestep to be independent of those at other timesteps 
lH , and therefore, its Fisher information matrix is given as the sum of every timestep. It follows that 


F = E, 


= Ex 


= Ex 




i~p(l|a) 


/ 91ogp(l|a)\ / (91ogp(l|a) 


\ da* J \ da* 


JMt 




r 




jjf^ - y‘E, [y‘] ^ - e, [y‘] + e, y 


where and are defined above. 


(23) 

(24) 

(25) 


Ei[u|] is given by 


Elbfc] = E|.^p(i|a 


p(l|a) 


E at{s)Pt{s) 


s^lahiy^k) 

= E E at{s)Pt{s) 


1 s^lab{\,k) 


= yl- 


(26) 

(27) 

(28) 


Ei[u-t;j] is given by 


Ei[u‘u*] = Ei...p(i|a) 


E Ms)l3tis) Y Ms)l3tis) 

sGlab{\,i) s^lab{\,j) 


(29) 


Unfortunately Eq. (l29l l cannot be analytically calculated in general. We apply the sufficient con¬ 
dition for the proposed approximation to be exact in section 14.41 By the assumption of one dom¬ 
inant path in a label sequence, Ei[u*z;j] = 0 for i ^ j. If the dominant path visits i at time t, 
T,s&ab{U)Ms)l3t{s) = p(l\a). Otherwise J2s&iab{\,t) Ms)Ptis) = 0. Under this condition, 
Eq. (l29l ) can be written as 


Ei[vlv*] = SijY E Ms)Pt{s) 

I sGlab{\.,i) 


(30) 

(31) 


whereby is the Kronecker delta. Substituting Ei[U‘] = Y* andEi[U*U*^] = diag(y‘) intoEq. (|25] | 
gives 


F = Ex 


^j^^(diag(y‘)-y‘y‘^)jAA, 


(32) 


which is the same form as Eqs. (|4|i and (fT4l i combined. 
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